1 Summer School of # Interdisciplinary Research on Brain Network Dynamics June 24-28, 2019 # Analog Crossbar Arrays – Future Neuromorphic Workhorses for Neural Networks **Tutorial** Roger Dangel, PhD Neuromorphic Devices and Systems Group IBM Zurich GmbH ### IBM Zurich: Neuromorphic Devices & Systems Group **Group leader** Bert Offrein **Post-Docs** **Research Staff** **Pre-Docs** ### Outlook Introduction: • Experiment: Human brain against computer Conceptional comparison: Brain vs. computer What makes the human brain so outstanding? – Can we mimic it? Part1: Neuromorphic computing – what is it? Neuromorphic tasks in AI • Anatomy of computational "heavy" workloads – the actual problem Computational challenge: Matrix-vector multiplications Current and future AI acceleration hardware Part 2: From brain-like to Deep Neural Networks (DNNs) Training of DNN with backpropagation algorithm - Mathematical background Status of today's Deep Neural Network processing Part 3: • Analog electrical crossbar array vs. DNN Synaptic weight processing operations Targeted device properties for analog electrical crossbar arrays Part 4: • Memristive devices for synaptic weight implementation Examples: Resistive Random Access Memory (ReRAM) Phase Change Memory (PCM) Ferroelectric Tunneling Junctions (FTJ) Summary ### **Keywords of this Tutorial** Human brain (Non) von-Neumann architecture Neuromorphic computing Analog vs. digital processing Deep Neural Networks Backpropagation algorithm Matrix-vector multiplication CPU / GPU / FPGA / ASIC Training / Inference **Accelerators** Memristive devices Synaptic weights Non-volatile memory Crossbar arrays PCM / ReRAM / FTJ Multiply & accumulate **Task 1: Mathematics** Brain Computer $$\sqrt{2} = ?$$ #### **Task 1: Mathematics** Computer: Result obtained in << 1 sec: 1.414213562373095048801688724209698078569 671875376948073176679737990732478462107038850387 5343276415727350138462309122970249248360558507372126441 2149709993583141322266592750559275579995050115278206057147010 9559971605970274534596862014728517418640889198609552329230484308714321450 839762603627995251407989687253396546331808829640620615258352395054745750287759961 729835575220337531857011354374603408498847160386899970699004815030544027790316454247823068 492936918621580578463311596668713013015618568987237235288509264861249497715421833420428568 Image to be recognized Brain Task 2: Image recognition What does the image show? omputer What does the image show? #### Task 1: Mathematics $\sqrt{2} \approx 1.4142$ Group of ring-tailed lemurs eating fruits Computer: Result obtained in << 1 sec: 1.41421356237309504880168872420969807 671875376948073176679737990732478462107038 Computer: not able to solve this task (yet) omputer #### Task 1: Mathematics $\sqrt{2} \approx 1.4142$ obtained in < 1 sec: Group of ring-tailed lemurs eating fruits Computer: Result obtained in << 1 sec: 1.41421356237309504880168872420969807 671875376948073176679737990732478462107038 53432764157273501384623091229702492483605585073 214970999358314132226659275055927557999505011527820 955997160597027453459686201472851741864088919860 Computer: not able to solve this task (yet) ### What makes the Human Brain so Outstanding? - Power efficiency: human brain $\leftrightarrow \approx 20$ Watts / supercomputers $\leftrightarrow$ up to MWatts - Brain recognizes patterns and images / can deduce facts from raw (noisy) data #### Brain at neural network level http://www.sciencephoto.com/dennis-kunkel-microscopy-collection - <u>Human</u> brain: ≈ 100 billions nerve cells (= neurons) - Each neuron receives signals from 1'000 10'000 other neurons via synapses → massive connectivity - Signals transmitted by synapses are adjustable: - → "synaptic weight" http://biomedicalengineering.yolasite.com/neurons.php - Signaling between neurons: Spikes, spike trains - Neuron activation: "Integrate and Fire" - Learning: Adjustment of the synaptic weights Spike Timing Dependent Plasticity: - "Neurons that fire together wire together" ### Conceptional Comparison: "Human Brain vs. Computer" Human brain Different (complementary) abilities Today's Computer - Nerve cells (neurons) are processing units - Analog operation - Distributed processor and memory - Massively, massively parallel processing - Slow information processing - Redundancy and fault-tolerance properties - **-** ... - Transistors are processing units - Digital operation - Centralized processor and memory - (Mostly) serial processing - Very fast information processing - Reliable and precise - **...** Question: Can we mimic the human brain to exploit its superiority in certain applications? ### Outlook - Part1: Neuromorphic computing what is it? - Neuromorphic tasks in AI - Anatomy of computational "heavy" workloads the actual problem - Computational challenge: Matrix-vector multiplications - Current and future AI acceleration hardware ### Neuromorphic Computing – What is it? **Ethymological:** "neuro" $\Leftrightarrow$ related to nerves or nervous system "morphic" ⇔ having form or structure of... **Definition:** Neuromorphic computing is a **brain-inspired signal processing** technology that tries to mimic the neuro-biological architecture of the brain and its functions. As interdisciplinary technology, it involves - biological, - physical, - mathematical, - computer science, - and electronic engineering concepts to design and realize new artificial neural network systems. http://www.web3.lu/category /science-philosophy/ ### **Neuromorphic Tasks in Al** Neuromorphic challenges in AI are tasks which normally require human "intellect", e.g.: - Memorizing complex information - Deducing facts from raw (unstructured) Data - Making recommendations and decisions in the presence of uncertainty and ambiguity ### **Anatomy of "Heavy" Computational Workloads – The Actual Problem** #### **Scientific Workloads** (electro) Chemistry **Drug Discovery** #### Backpropagation $$\begin{array}{c|c} \mathsf{PDES} \\ \hline \frac{\partial^{2} u_{1}}{\partial x_{1}^{2}} + \frac{\partial^{2} u_{2}}{\partial x_{2} \partial x_{1}} + \frac{1}{\partial} & x_{i,j}^{l} = \sum_{m} \sum_{n} w_{m,n}^{l} o_{i+m,j+n}^{l-1} + b_{i,j}^{l} & f_{1} = 0 \\ \hline \frac{\partial^{2} u_{1}}{\partial x_{1} \partial x_{2}} + \frac{\partial^{2} u_{2}}{\partial x_{2}^{2}} + \frac{1}{\partial} & \delta_{i,j}^{l} = f(x_{i,j}^{l}) & f_{2} = 0 \\ \hline \frac{\partial^{2} u_{1}}{\partial x_{1} \partial x_{3}} + \frac{\partial^{2} u_{2}}{\partial x_{2} \partial x_{3}} + & \frac{\partial E}{\partial x_{i,j}^{l}} = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} \delta_{i-m,j-n}^{l-1} w_{m,n}^{l+1} f'\left(x_{i,j}^{l}\right) & f_{3} = 0 \\ \hline \frac{\partial E}{\partial w_{m,n}^{l}} = \sum_{i=0}^{H-1} \sum_{j=0}^{W-k} \delta_{i,j}^{l} o_{i+m,j+n}^{l-1} & \frac{1}{\partial} \sum_{i=0}^{H-1} \sum_{j=0}^{W-k} \delta_{i,j}^{l} o_{i+m,j+n}^{l-1} & \frac{1}{\partial} \sum_{i=0}^{H-1} \sum_{j=0}^{W-k} \delta_{i,j}^{l} o_{i+m,j+n}^{l-1} & \frac{1}{\partial} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \delta_{i,j}^{l} o_{i+m,j+n}^{l-1} & \frac{1}{\partial} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \delta_{i,j}^{l} o_{i+m,j+n}^{l-1} & \frac{1}{\partial} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \delta_{i,j}^{l} o_{i+m,j+n}^{l-1} & \frac{1}{\partial} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \delta_{i,j}^{l} o_{i+m,j+n}^{l-1} & \frac{1}{\partial} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \delta_{i,j}^{l} o_{i+m,j+n}^{l-1} & \frac{1}{\partial} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \delta_{i,j}^{l} o_{i+m,j+n}^{l-1} & \frac{1}{\partial} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \sum_{j=0}^{W-k} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \sum_{j=0}^{W-k} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \sum_{i=0}^{W-k} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \sum_{j=0}^{W-k} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \sum_{i=0}^{W-k} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \sum_{i=0}^{W-k} \sum_{i=0}^{W-k} \sum_{j=0}^{W-k} \sum_{i=0}^{W-k} \sum_{i=0}^{W-k} \sum_{i=0}^{W-k} \sum_{i=0}^{W-k} \sum_{i=0}^{W-k} \sum_{i=0}^{W-$$ #### AI / Machine Learning **Time-Series Predictions** Graph Analytics Weather/Climate Clustering Algorithms Image Classification ### Computational Challenge: Matrix-Vector Multiplications Matrix-vector multiplications of the form $$\boldsymbol{W}\boldsymbol{x} = \begin{bmatrix} w_{0,0} & w_{0,1} & w_{0,2} & & w_{0,N} \\ w_{1,0} & w_{1,1} & w_{1,2} & \dots & w_{1,N} \\ w_{2,0} & w_{2,1} & w_{2,2} & & w_{2,N} \\ & \vdots & & \ddots & \vdots \\ w_{M,0} & w_{M,1} & w_{M,2} & \dots & w_{M,N} \end{bmatrix} \cdot \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_N \end{bmatrix} = \begin{bmatrix} \sum_{i=0}^{N} w_{0,i} x_i \\ \sum_{i=0}^{N} w_{1,i} x_i \\ \sum_{i=0}^{N} w_{2,i} x_i \\ \vdots \\ \sum_{i=0}^{N} w_{M,i} x_i \end{bmatrix}$$ are common to the mentioned workloads and dominate the computation time and energy consumption. Matrix-vector multiplications are "computationally expensive"! Develop dedicated hardware (→ Analog Crossbar Arrays) which enables **efficient analog implementation of matrix-vector multiplications** and therefore acceleration of Deep Neural Network Learning #### **Current and Future Al Acceleration Hardware** Flexibility ### Central Processing Unit (CPU) ■ CPUs were originally designed for general computing workloads #### Graphics Processing Unit (GPU) **Current Workhorse** - GPUs operate on vectors of data in parallel - GPUs are effective at processing same set of operations in parallel (single instruction, multiple data (SIMD) - GPUs have well-defined instruction-set and fixed data width #### ) - Processor Field-Programmable Gate-Array (FPGA) Limited spread - FPGAs contain an array of programmable logic blocks, and a hierarchy of reconfigurable interconnects - FPGA are reconfigurable, what makes evolution of hardware, framework and software easier - FPGAs are effective at processing same or different set of operations in parallel (multiple instructions, multiple data (MIMD) - FPGAs do not have predefined instruction-set or fixed data width. Application-Specific Integrated Circuit (ASIC) Under investigation - ASICs are application-specifically designed hardware - ASICs employ special strategies, e.g. optimized memory use or use of lower precision arithmetics ### Resistive Processing Unit (RPU) based on Analog Crossbar Arrays ### Outlook - Part 2: From brain-like to Deep Neural Networks (DNNs) - Training of DNN with backpropagation algorithm Mathematical background - Status of today's Deep Neural Network processing ### From Brain to Brain-like Neural Network - Omni-directional signal flow - A-synchronous pulse signals - Information encoded in signal timing - Difficult to implement efficiently on standard computer hardware #### **Artificial Neural Network** #### **Brain-like neural network** Simplified model #### **Artificial Neural Network (ANN)** - ANNs are neuromorphic computing models, which mimic the brain in a simplified way. - ANNs are composed of multiple nodes (= artificial neurons) which can be arranged in special configurations - The first developed, easiest and most common ANN is the: #### Feed-forward Deep Neural Network (DNN) ### **DNN** $\Box$ better fit to standard hardware - Feed-forward sequential processing - Information encoded in signal amplitude - Neuron activation: Weighted sum + Threshold - Training with "Backpropagation algorithm" ### Operation Phases of Deep Neural Network (DNN) Phase 1 hase 2 Computational speed and efficiency are extremely important because **training** of Deep Neural Networks **can range from days to weeks** (even with high-performance computers)! ### Generic scheme for iterative error minimization by adjusting the synaptic weights #### **Neural net as chain of vector operations:** #### **Components:** Layers of neurons Synaptic interconnections - #### **Mathematical operations:** : Signal vector : Synaptic weight matrix $[W_n]$ Per-element neural (non-linear) activation function (sigmoid): ### For many training case inputs **X** with target response **Y**<sub>target</sub>: The store $X \rightarrow X$ Response $X \rightarrow X$ Store neuron activation patterns $X_i$ for later use $$X \xrightarrow{W_1} \xrightarrow{W_2} \xrightarrow{W_2} \xrightarrow{W_3} \xrightarrow{W_3} \xrightarrow{W}$$ 2 Determine output error **&**: $$\varepsilon = \frac{1}{2} \sum_{i} [y_i - y_{\text{target } i}]^2$$ **Backward Propagate:** Which neuron inputs have strongest influence on $\mathcal{E}$ ? $$\rightarrow$$ Error gradient vectors $\delta_i$ $$(x) + \frac{W_1^T}{\delta_1} + (\sigma) + \frac{W_2^T}{\delta_2} + (\sigma) + \frac{W_3^T}{\delta_3} + (\sigma) + (\varepsilon)$$ Adjust weights that were active ( $\propto x$ ), proportionally to their influence on error $\mathcal{E}$ ( $\propto \delta$ ): $$\Delta W = -\eta \times \otimes \delta$$ $$\Delta w_{ij} = -\eta x_i \delta_i$$ ### **Status of Today's Deep Neural Network Processing** Processing dominated by large matrix operations Forward propagation: Backward propagation: Weight update: Scale $\propto N^2$ Neurons/layer Large training datasets: Thousands of training cases - Inefficient on standard Von-Neumann architecture systems: - (Mostly) serial processing - Low computation to IO ratio - → Memory bottleneck **High performance computer** Today's standard computer architecture (→ proposal by John Von-Neumann in 1945) **Need for faster** and more efficient **DNN** processing Current situation #### **Borrow some concepts from the brain:** - Analog signal processing - Fully parallel processing - Tight integration of processing and memory **Analog Crossbar Arrays** ### Outlook - Part 3: Analog electrical crossbar array vs. DNN - Synaptic weight processing operations ### **Analog Electrical Crossbar Array** ### **Synaptic Weight Processing Operations** Synaptic weight update **X** Input vector **[W**] Weight matrix [W]<sup>T</sup> Transposed weight matrix ### Challenge Update must be proportional to signals on rows $(\propto x_i)$ and on columns $(\propto \delta_i)$ - **Symmetric** increase and decrease of weight - Analog behavior: > 100 levels preferred (ca. 8 bit) ### Outlook - Part 4: - Targeted device properties for analog electrical crossbar arrays - Memristive devices for synaptic weight implementation - Examples: Resistive Random Access Memory (ReRAM) Phase Change Memory (PCM) Ferroelectric Tunneling Junctions (FTJ) ### Targeted Device Properties for Analog Electrical Crossbar Arrays #### **Our Dream-Device:** - CMOS compatibility - Low voltage operation - Small device footprint - Very short (re-)set time - Long retention time (<-> NVM) - Low drift - High dynamic range - Large resistance range (high-resistance → low power) - Reproducibility, low variability - (Some) linearity & symmetry Gokmen & Vlasov, Acceleration of Deep NN Training..., Frontiers in Neuroscience, 2016 #### **Operation:** #### "Programming" Resistance: (representative, generic characteristic) #### **Programming Scheme:** Pulse Encoding (Incremental) → linear & symmetric • # pulses ### Memristive Devices for Synaptic Weight Implementation #### **ReRAM** #### **PCM** #### **FTJ** #### Others: - **FeRAM** (Ferro-Electric RAM) - **MRAM** (<u>M</u>agnetic <u>RAM</u>) - **ECRAM** (<u>E</u>lectro-<u>C</u>hemical RAM) ### Resistive Random Access Memory (ReRAM) - ReRAM (also called RRAM) is one type of memristive non-volatile memory that works by changing the resistance across a dielectric solid-state material Sufficiently high voltage V<sub>forming</sub> makes insulating dielectric material conductive - Filament-like or homogeneous current conduction path(s) induced by defects (oxygen-vacancies) - Switching between Low Resistance State (LRS) and High Resistance State (HRS) by applying suitable voltages -V<sub>reset</sub> and +V<sub>set</sub> - The oxygen vacancies act as charge carriers, meaning that the depleted area has a much lower resistance #### ReRAM phases: - FORMING: creation of conducting filament in dielectric material between electrodes - **RESET** (LRS → HRS): partial dissolution of filament - **SET** (HRS → LRS): recreation of filament - STORAGE: retain last resistance ### Resistive Random Access Memory (ReRAM) #### ■ Challenge: With only one (or a few) localized conductive filaments, switching would be quite abrupt (between 2 resistance states: LRS and HRS) Metal Conductive oxide filament Bottom Electrode Wetal vap oxide Residual However, for use of ReRAM in analog crossbar arrays, gradual tuning of resistance with many intermediate states is required Use of specifically **engineered oxides with suitable oxygen intercalation**<sup>(\*)</sup> **properties** as electrodes Volumetric changes of conductive filament(s) (i.e., in lateral dimension) Intercalation: In chemistry, intercalation is the reversible inclusion or insertion of molecules (or ions) into materials... (Wikipedia) ### Phase Change Memory (PCM) - PCM (also called PCRAM) is another memristive non-volatile memory - PCM shows amorphous and crystalline phase - Rapid and repeated switching between two phases possible - Switching typically induced by optical or electrical heating - Physical properties vary significantly between phases crystalline phase $\rightarrow$ Low Resistance State (LRS) amorphous phase $\rightarrow$ High Resistance State (HRS) Ratio of electrical resistances $R_{LRS}: R_{HRS} = 1:100$ to 1:1000 - Many phase change materials are chalcogenides, most studied and utilized: Ge<sub>2</sub>Sb<sub>2</sub>Te<sub>5</sub> (GST) Hegedüs, J. & Elliott, S. R., *Nature Mater.* **7**, 399–405 (2008). Top Electrode Crystalline Insulator #### Two-level-cell PCM - only two states - commercially available as Storage Class Memory (SCM) #### Multi-level-cell PCM - many intermediate states - under development for emerging analog crossbar arrays ### Ferroelectric Tunneling Junction (FTJ) ■ Ferroelectric materials are dielectrics that exhibit a macroscopic electrical polarization P, even in absence of an external electric field $(E = 0 \rightarrow P = \pm P_{remanent})$ . By applying an electric field, the macroscopic polarization state of ferroelectric material can be gradually tuned ( $\rightarrow$ ferroel. hysteresis curve) because of polarization switching of individual domains in the material from to or vice versa. #### Ferroelectric material with boundary conditions: ■ Ferroelectric Tunneling Junction is based on a few nm thick ferroelectric barrier layer sandwiched between two different electrodes (typically metal / semiconductor). Gradual polarization state tuning possible by applying suitable positive or negative voltage pulses across FTJ. #### Ferroelectric hysteresis curve #### Material example: BaTiO<sub>3</sub> - Cubic phase of BaTiO<sub>3</sub> - Perovskite crystal - off-center-position of Ti4+ - → Ferroelectric behavior J. P. Velev et al., "Predictive modelling of ferroelectric tunnel junctions", npj Computational Materials, vol. 2, Article no: 16009 (2016) Ferroelectric tunneling junction ### Ferroelectric Tunneling Junction (FTJ) **Gradual polarization state tuning** can be achieved by applying suitable positive or negative voltage pulses across FTJ. Many intermediate polarization states can be induced by **nucleation and growth of domains with opposite polarization.** Electrical current through FTJ varies with macroscopic polarization state because of the **different tunnel widths for the two opposite polarizations** states in individual domains. Electrical resistance of FTJ can be tuned by polarization state. $\rightarrow$ "Tunneling Electro-Resistance" (TER) with up to $10^4$ x variation. **FTJ retains last resistance value** when power is turned off. ## Dependence of tunneling current and resistance from polarization ### **Summary** For the learning ("training") and use ("inference") of Artificial Neural Networks, digital (co-)processors (CPUs, GPUs, FPGAs and ASICs) in computer systems based on Von-Neumann architecture are used almost exclusively today. One promising alternative to these energy-hungry digital logic based computer systems is Analog Neuromorphic Computing, where computationally time-consuming and therefore expensive operations are performed by specialized accelerators comprising analog elements with the promise to improve the performance and power efficiency by factors of 1000 to 10,000. In general, suitable compute elements are programmable analog devices with non-volatile memory capabilities that can be arranged in crossbar arrays to perform various mathematical operations. The main requirements for such emerging "non Von-Neumann" architectures are vector-matrix multiplications and the ability to provide the transposed matrix for learning as well as means to store analog synaptic weights. This mitigates the huge communication overhead for the operands in traditional systems, i.e. avoids the time and energy consuming massive data shuffling between processor and memory.